feat(backend): Q5_K packed matmul + ARM NEON + Kotlin/Native cinterop path#734
Merged
Conversation
Adds Q5_K as a packed in-kernel dequant-matmul format (previously Q5_K was only eagerly decoded to FP32 on load), mirroring the existing Q4_K plumbing, and hand-written ARM NEON paths for the native CPU kernels. Q5_K (256-elt / 176-byte super-block: d, dMin, 12 packed scales, 32-byte qh high-bit plane, 128-byte qs low nibbles; 5-bit code = lowNibble | (5th<<4)): - TensorEncoding.Q5_K; Q5_KTensorData / Q5_KBlockTensorData (5th-bit fold). - Q5KMatmulKernel SPI + matmulQ5K()/"Q5_K" in KernelProvider.supports(). - ScalarQ5_KMatmulKernel (commonMain/KN), PanamaVectorQ5_KMatmulKernel (JVM), native C skainet_q5k_matmul + NativeQ5KMatmulKernel (FFM); all registered. - DefaultCpuOps matmul dispatch + lazy-transpose branches. - StreamingGgufParametersLoader: Q5_K + Q6_K packed branches (a Q5_K_M GGUF now loads end-to-end instead of SKIP'ing most tensors). Tests: Q5_KBlockTensorData bit-exact vs DequantOps golden across blocks; native<->Panama<->scalar matmul parity; KernelSupportMatrixTest gate updated. ARM NEON (behind #if __ARM_NEON in skainet_simd.h; x86 keeps the scalar fallback, re-verified green): - fp32 (broadcast+vfmaq), q8_0 (widen int8->f32+vfmaq), q4k/q5k (nibble unpack + dual code/input accumulators; q5k folds the qh 5th bit via a runtime-count vshlq_u8). - CMake aarch64 branch: -march=armv8.2-a+fp16+dotprod (no +i8mm — A55 lacks it). Cross toolchain-aarch64.cmake + opt-in -PcrossArm64 gradle tasks; default x86 build unaffected. BOARD-VERIFY-PENDING: the NEON paths are aarch64-syntax-validated (clang --target=aarch64) but not executed (x86 host, no QEMU). Run the parity tests under qemu-aarch64 or on the SL2610 before relying on them. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The hand-written matmul kernels were JVM-only (consumed via FFM), but the SL2610 board binary is Kotlin/Native — it can't use the FFM wrapper. Add a K/N consumption path via cinterop so the board gets the same C (and, on aarch64, NEON) kernels. - CMake builds a STATIC archive (skainet_kernels_static -> libskainet_kernels.a) alongside the SHARED lib; same sources + flags (incl. the aarch64 NEON march). - cinterop .def (skainet_kernels.h -> sk.ainet.kernels.cinterop). - linuxX64 target on the (previously jvm-only) module, linking the static archive into K/N binaries; link tasks depend on the CMake build. - NativeKnQ5KMatmulKernel (linuxX64Main): calls skainet_q5k_matmul via cinterop with pinned arrays (zero-copy). POC verified on the host (linuxX64): NativeKnQ5KMatmulKernelParityTest — the cinterop kernel matches the commonMain ScalarQ5_KMatmulKernel across 4 shapes (tests=4, failures=0). JVM/FFM path unchanged (jvmTest green). linuxArm64 board target + NEON runtime check are the remaining step. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The K/N analogue of the JVM NativeKernelProvider (FFM): a KernelProvider (priority 100) exposing the cinterop-backed Q5_K/Q4_K/Q8_0/Q4_0 matmul kernels, plus installNativeKernels() to register it in KernelRegistry — the path the eager runtime's DefaultCpuOps.chooseQuantizedMatmulHeap uses to resolve a kernel. K/N has no ServiceLoader, so registration is an explicit call by the consumer (scalar fallback for Q6_K etc. is registered separately from skainet-backend-cpu). Verified on linuxX64: NativeKnKernelProviderTest — installNativeKernels makes native-cinterop the best-available provider, its Q5_K kernel is the registry-resolved kernel, and it matches the scalar reference (6 K/N tests green total). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Promote the K/N cinterop path from the linuxX64 POC to the real board target: - linuxArm64 target with the same skainet_kernels cinterop; links the aarch64 cross-built static archive (cmake-build-arm64/libskainet_kernels.a, NEON). - Shared `nativeMain` source set holds NativeKn*MatmulKernel + the provider, so linuxX64 and linuxArm64 share one implementation (cinterop bindings are commonized across both targets). - linuxArm64 link tasks depend on the aarch64 cross-build only under -PcrossArm64 (toolchain present); a plain host build still compiles linuxArm64 to a klib. Verified on host: compileKotlinLinuxArm64 + cinteropSkainetKernelsLinuxArm64 succeed (cross-compiled from x86); linuxX64Test still green (6 tests) on the shared nativeMain. Final aarch64 binary link + NEON runtime are board-verify-pending. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
📖 Documentation Preview The documentation has been built successfully for this PR. Generated Files:
Artifacts:
This comment will be updated automatically when the PR is updated. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a first-class Q5_K packed in-kernel dequant-matmul to the CPU backend (it was previously only eagerly decoded to FP32), hand-written ARM NEON kernels, and a Kotlin/Native cinterop consumption path so the kernels run on the board binary (not just the JVM via FFM).
What's here
TensorEncoding.Q5_K,Q5_KTensorData/Q5_KBlockTensorData(5th-bit fold fromqh),Q5KMatmulKernelSPI, scalar (commonMain) + Panama (JVM) + native-C kernels,DefaultCpuOpsdispatch + lazy transpose, and aStreamingGgufParametersLoaderQ5_K/Q6_K packed branch.#if __ARM_NEON; x86 keeps the scalar fallback): fp32, q8_0, q4k, q5k. CMake aarch64 branch-march=armv8.2-a+fp16+dotprod(no+i8mm— A55 lacks it). Cross toolchain + opt-in-PcrossArm64.libskainet_kernels.a;linuxX64+linuxArm64targets with a sharednativeMain;NativeKn*MatmulKernel+NativeKnKernelProvider(+installNativeKernels()) so K/N resolves the C kernels throughKernelRegistry.Verification
DequantOpsgolden across blocks; native↔Panama↔scalar matmul parity; capability-matrix gate updated.linuxX64(6 tests);compileKotlinLinuxArm64+ cinterop cross-compile from x86.Board-verify-pending
The NEON paths are aarch64-syntax-validated (clang
--target=aarch64) but not executed (x86 host, no QEMU). The final aarch64 binary link + NEON runtime parity need the SL2610 (or QEMU).🤖 Generated with Claude Code